Clean benchmark artifacts and freeze paper snapshot by MaxGhenis · Pull Request #5 · PolicyEngine/policybench

MaxGhenis · 2026-05-02T02:17:32Z

Summary

remove legacy v1/v2 result artifacts and generated paper scratch outputs from the tracked repo
add a frozen 2026-05-01 paper snapshot with hash/count tests
switch public CLI/docs wording to PolicyEngine reference outputs while keeping ground-truth as a compatibility alias
fix the full-run exporter/runbook to combine by-model chunked outputs correctly

Verification

uv run pytest -q
uv run ruff check .
uv run ruff format --check .
npm --prefix app run lint
npm --prefix app run build
uv run python paper/render_paper.py
git diff --check

No benchmark LLM responses were regenerated.

…model at 72.3%

Replace placeholder components with 4 production views: - ScatterPlot: predicted vs actual with condition toggle - ModelLeaderboard: ranked model comparison table - ProgramHeatmap: variable x model accuracy grid - ScenarioExplorer: per-household drill-down with all predictions Remove old unused components (ModelComparison, ProgramBreakdown, ExampleScenarios) and mock data file. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Replace two-column hero layout with single-column compact header - Swap bordered stat cards for inline stat bar - Add PE mark icon to hero and sticky nav brand - Remove redundant sidebar (top models, preprint card) - Remove redundant CTAs (View leaderboard, Explore households) - Exclude public/paper/ from ESLint, add img-element directives Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Exclude generated notebook from ruff via extend-exclude - Fix F402 shadow warnings in scenarios.py (rename loop vars) - Wrap long strings/lines for E501 across all Python files - Run ruff format for consistent style Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Merge hero and sticky nav into single component that collapses on scroll - Expanded: full title, "a [PE logo] project" tagline, subtitle, stats - Collapsed: compact bar with nav tabs, view selector, Paper link - Smooth CSS transitions between states - Zero duplication — one header, two modes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Replace binary scrolled toggle with continuous scroll progress (0→1) - All header properties interpolate smoothly: title size, padding, opacity, background, nav visibility - Change tagline from "a PE project" to "by [PE logo]" - Uses rAF-throttled scroll listener for 60fps Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Wrap long help strings in cli.py and split long assertion lines in tests. Also wraps two pre-existing E501 violations in analysis.py and test_analysis.py. All 133 tests pass. ruff check + ruff format --check both clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vercel · 2026-05-02T02:17:37Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
policybench	Ready	Preview, Comment	May 2, 2026 2:17am

MaxGhenis and others added 30 commits February 25, 2026 17:49

fixup! fixup! Add Claude Sonnet 4.6 to benchmark — now best no-tools …

c985d89

…model at 72.3%

Migrate dashboard to Next.js and refresh benchmark app

6cd9bce

Track dollar formatting helper

ef2941f

Refine policybench benchmark pipeline

7c2ae88

Repair partial batch answers

f508135

Increase Gemini Pro batch output budget

d9aa9c9

Adopt bounded benchmark scoring

fcd73aa

Expand benchmark app, diagnostics, and paper publishing

aca14f4

Refresh rendered paper assets

205c01b

Fix paper PDF render pipeline

6e40c72

Remove leaderboard leader callout

095063d

Tighten benchmark artifacts and analysis

6888557

Fix diagnostics explanation tooltip

5367657

Track main as the canonical CI branch

28de8e7

Preserve pre-v2 main history on main

9504a62

Use saved scenario manifests and guard benchmark resumes

6b4dbb2

Make benchmark answer extraction strict

7657e6a

Rebuild benchmark output contract

d245687

Expand headline scope to net income components

c88599c

Add prompt mode comparison script

1693950

Migrate PolicyBench runtime to policyengine.py

44e7a03

Support UK employment income leaf inputs

4273c41

Filter conditional prompt inputs

f5c3668

Add provider request wall timeout

cadcd60

MaxGhenis added 14 commits April 27, 2026 07:11

Omit aggregate net worth from prompts

27d329c

Clarify benchmark prompt contract

15fa09e

Rename UK transfer scenario source

913ed76

Suppress prior self-employment sentinel input

e2c9daa

Omit prior-year inputs from prompts

d46daf4

Remove noisy prompt inputs

cf48728

Prefer current spec prompt descriptions

9d3a88e

Remove v1 benchmark code

4b47716

Scale response budget for expanded outputs

a2d5287

Publish UK 100-household benchmark

9486494

Publish 100-household US and UK benchmark results

9b1a968

Keep DeepSeek experimental only

fef6d94

Tighten explanation consistency contract

59b17c7

Clean benchmark artifacts and freeze paper snapshot

db73c84

MaxGhenis merged commit 1f0f7cb into main May 2, 2026
4 checks passed

MaxGhenis deleted the codex/manifest-resume-guards branch May 2, 2026 02:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean benchmark artifacts and freeze paper snapshot#5

Clean benchmark artifacts and freeze paper snapshot#5
MaxGhenis merged 44 commits intomainfrom
codex/manifest-resume-guards

MaxGhenis commented May 2, 2026

Uh oh!

vercel Bot commented May 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MaxGhenis commented May 2, 2026

Summary

Verification

Uh oh!

vercel Bot commented May 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants